Gradient based training method for a support vector machine

ABSTRACT

A training method for a support vector machine, including executing an iterative process on a training set of data to determine parameters defining the machine, the iterative process being executed on the basis of a differentiable form of a primal optimisation problem for the parameters, the problem being defined on the basis of the parameters and the data set.

[0001] The present invention relates to a training method for a supportvector machine.

[0002] Computer systems can be configured as learning machines that areable to analyse data and adapt in response to analysis of the data, andalso be trained on the basis of a known data set. Support VectorMachines (“SVMs”), for instance, execute a supervised learning methodfor data classification and regression. Supervised methods refer totasks in which a machine is presented with historical data with knownlabels, i.e. good customers vs bad customers, and then the machine istrained to look for patterns in the data. SVMs represent a recentdevelopment in “neural network” algorithms and have become increasinglypopular over the past few years. Essentially these machines seek todefine a decision surface which gives the largest margin or separationbetween the data classes whilst at the same time minimising the numberof errors. This is usually accomplished by solving a specific quadraticoptimisation problem.

[0003] In the simplest linear version, the output of the SVM is given bythe linear function

y=w·x+βb  (1)

[0004] or its binarised form

y=sgn(w·x+βb)  (2)

[0005] where the vector w defines the decision surface, x is the inputdata, y is the classification, β is a constant that acts on a switchbetween the homogeneous (β=0) and the non-homogeneous (β=1) case, b is afree parameter usually called bias and “sgn” denotes the ordinary signumfunction, i.e. sgn(ξ)=1 for ξ>0, sgn(ξ)=−1 for ξ<1 and sgn(0)=0.Typically, the first of these two forms is used in regression (moreprecisely, the so-called ε-insensitive regression), and the other inclassification tasks. The problem is in fact more subtle than thisbecause training the machine ordinarily involves searching for a surfacein a very high dimensional space, and possibly infinite dimensionalspace. The search in such a high dimensional space is achieved byreplacing the regular dot product in the above expression with anonlinear version. The nonlinear dot product is referred to as theMercer kernel and SVMs are sometimes referred to as kernel machines.Both are described in V. Vapnik, Statistical Learning Theory, J. Wiley,1998, (“Vapnik”); C. Burges, A Tutorial on Support Vector Machines forPattern Recognition, Data Mining and Knowledge Discovery, 2, 1998,(“Burges”); V. Cherkassky and F. Mulier, Learning From Data, John Wileyand Sons, Inc., 1998; and N. Christinini and J. Shawe-Taylor, 2000, AnIntroduction to Support Vector Machines and other Kernel-Based LearningMethods, Cambridge University Press, Cambridge 2000.

[0006] Most solutions for the optimisation problem that are required totrain the SVMs are complex and computationally inefficient. A number ofexisting training methods involve moving the optimisation problem toanother domain to remove a number of constraints on the problem. Thisgives rise to a dual problem which can be operated on instead of theprimal problem and currently the fastest training methods operate on thedual problem. It is desired however to provide a training method whichis more efficient and alleviates difficulties associated with operatingon the dual problem, or at least provides a useful alternative.

[0007] The present invention relates to a training method for a supportvector machine, including executing an iterative process on a trainingset of data to determine parameters defining said machine, saiditerative process being executed on the basis of a differentiable formof a primal optimisation problem for said parameters, said problem beingdefined on the basis of said parameters and said data set.

[0008] Advantageously, the training method can be adapted for generationof a kernel support vector machine and a regularisation networks.

[0009] The usage of a differentiable form of the optimisation problem isparticularly significant as it virtually removes the explicit checkingof constraints associated with an error penalty function.

[0010] Preferably, in the case of classification, and for the SVM,

y=sgn(w·x+/βb),

[0011] where y is the output, x is the input data, β is 0 or 1, thevector w and bias b defining a decision surface is obtained as theargument by minimising the following differentiable objective function:${\Psi \left( {w,b} \right)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {1 - {y_{i}\left( {{w \cdot x_{i}} + {\beta \quad b}} \right)}} \right)}}}}$

[0012] where C>0 is a free parameter, x_(i), i=1, . . . ,n, being thedata points, y_(i)=±1, i=1, . . . , n, being the known labels, n beingthe number of data points and L being a differentiable loss functionsuch that L(ε)=0 for ε≦0. The said iterative process preferably operateson a derivative of the objective function Ψ until the vectors convergeto a vector w defining the machine.

[0013] Preferably, for ε-insensitive regression, the differentiable formof the optimisation problem is given as minimisation of the functional${\Psi \left( {w,b} \right)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {{{y_{i} - {w \cdot x_{i}} + {\beta \quad b}}} - ɛ} \right)}}}}$

[0014] where the ε>0 is a free parameter.

[0015] The present invention also provides a support vector machine fora classification task having an output y given by${y - {y(x)}} = {{\sum\limits_{i = 1}^{n}{y_{i}\alpha_{i}{k\left( {x_{i},x_{j}} \right)}}} + {\beta \quad b}}$

[0016] where xεR^(m) is a data point to be classified and x, aretraining data points, k is a Mercer kernel function as described inVapnik and Burges, and α_(i) are coefficients determined by

α_(i) =CL′(1−y _(i) η _(i) −βb)

[0017] where L′(ε) is the derivative of the loss and the values Ti, aredetermined by iteratively executing $\begin{matrix}{{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}} - {y_{i}\beta \quad b^{t}}} \right)}y_{i}{k\left( {x_{i},x_{j}} \right)}}}}} \right)}}},} \\{b^{t + 1} = {{\beta \quad b^{t}} + {{\delta\beta}\quad C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}} - {y_{i}\beta \quad b^{t}}} \right)}{y_{i}.}}}}}}\end{matrix}$

[0018] where δ>0 is a free parameter (a learning rate) and/or, in thehomogeneous case (β=0) by iteratively executing:$\eta_{j}^{t + 1}C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}}} \right)}y_{i}{{k\left( {x_{i},x_{j}} \right)}.}}}$

[0019] where i, j=1, . . . , n, n being the number of data points, trepresents an iteration and L′ is the derivative of the loss function L.

[0020] The present invention also provides a support vector machine forε-regression having output y given by${y(x)} = {{\sum\limits_{i = 1}^{n}{\beta_{i}{k\left( {x,x_{i}} \right)}}} + {\beta \quad b}}$

[0021] where xεR^(m) is a data point to be evaluated and x_(i) aretraining data points, k is the Mercer kernel function, β=0 or β_(i), andβ_(i) and bias b are coefficients determined by

β_(i) =CL′(|y _(i)−η_(i) −βb|−ε)sgn(y _(i)−η_(i) −βb)

[0022] where ε is a free parameter and the values ηi, and b aredetermined by iteratively executing $\begin{matrix}{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}\quad {sgn}\quad \left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right){k\left( {x_{i},x_{j}} \right)}}}}} \right)}}} \\{b^{t + 1} = {b^{t} + {{\delta\beta}\quad C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}\quad {sgn}\quad \left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}}}}}\end{matrix}$

[0023] where δ>0 is a free parameter (learning rate) and/or, in thehomogeneous case (β=0) by iteratively executing:$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{j}^{t}}} - ɛ} \right)}\quad {sgn}\quad \left( {y_{i} - \eta_{i}^{t}} \right){{k\left( {x_{i},x_{j}} \right)}.}}}}$

[0024] where i, j=1, . . . , n, n being the number of data points and trepresents an iteration.

[0025] Preferred embodiments of the present invention are hereinafterdescribed, by way of example only, with reference to the accompanyingdrawings, wherein:

[0026]FIG. 1 is a block diagram of a preferred embodiment of a supportvector machine;

[0027]FIG. 2 is a graph illustrating an optimal hyperplane establishedby the support vector machine for linear classification;

[0028]FIG. 3 is a graph of a hypersurface established by the supportvector machine for a non-linear classification;

[0029]FIG. 4 is a graph of a regression function established by thesupport vector machine for linear regression;

[0030]FIG. 5 is a graph of a regression function established by asupport vector machine for non-linear regression;

[0031]FIG. 6 is a graph of differential loss functions forclassification and regression for the support vector machine; and

[0032]FIG. 7 is a graph of differential loss functions forregularisation networks established by the support vector machine.

[0033] A Support Vector Machine (SVM) 2 is implemented by a computersystem 2 which executes data analysis using a supervised learning methodfor the machine. The computer system 2 of the Support Vector Machineincludes a processing unit 6 connected to at least one data input device4, and at least one output device 8, such as a display screen. The inputdevice 4 may include such data input devices as a keyboard, mouse, diskdrive etc for inputting data on which the processing unit can operate.The processing unit 6 includes a processor 10 with access to data memory12, such as RAM and hard disk drives, that can be used to store computerprograms or software 14 that control the operations executed by theprocessor 10. The software 14 is executed by the computer system 2. Theprocessing steps of the SVM are normally executed by the dedicatedcomputer program or software 14 stored on a standard computer system 2,but can be executed by dedicated hardware circuits, such as ASICs. Thecomputer system 2 and its software components may also be distributedover a communications network. The computer system 2 may be a UNIXworkstation or a standard personal computer with sufficient processingcapacity to execute the data processing step described herein.

[0034] The primal problem for an SVM is discussed in Vapnik. In the caseof classification the exact form of the problem is as follows.

[0035] Given labelled training data (x₁,y₁), . . . , (x_(n),y_(n)),xεR^(m), YεR^(m), yε{−1,1}, the primal problem is to minimise$\begin{matrix}{{{\Psi (w)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{\overset{\sim}{L}\left( \xi_{i} \right)}}}}}{{subject}\quad {to}}} & (3) \\{{{{y_{i}\left( {w \cdot x_{i}} \right)} \geq {1 - {\xi_{i}\quad {and}\quad \xi_{i}}} \geq {0\quad i}} = 1},\ldots \quad,{n.}} & (4)\end{matrix}$

[0036] Here {tilde over (L)} is a convex loss function; the ξ_(i)srepresent errors and are often referred to as slack variables and C>0 isa free parameter. The typical examples of loss function are of the form{tilde over (L)}(ξ)=ξ^(p), where p≧1.

[0037] The first term on the right hand side of equation (3) controlsthe margin 20 between the data classes 22 and 24, as shown in FIG. 2,while the second term describes the error penalty. The primal problem isan example of a constrained quadratic minimisation problem. A commonapproach when dealing with constraints is to use the method of Lagrangemultipliers. This technique typically simplifies the form of constraintsand makes the problem more tractable.

[0038] Currently the fastest available training methods for the SVMoperate on a dual problem for the case of linear loss (p=1), withinherent complexity and efficiency problems.

[0039] To alleviate these difficulties, the inventors have developed atraining method which solves the primal problem directly. To achievethis it has been determined that the optimisation task (3 and 4) can berewritten as a minimisation of the objective function $\begin{matrix}{{\Psi (w)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {1 - {y_{i}{w \cdot x_{i}}}} \right)}}}}} & (5)\end{matrix}$

[0040] where the (modified loss) L(χ)={tilde over (L)}(max(0, χ)) isobtained after a direct substitution for the slack variableξ_(i)=max(0,1−y_(i)w·w_(i)), for i=1, 2, . . . , n. The modified lossL(χ) is assumed to be 0 for χ≦0. In this form the constraints (4) do notexplicitly appear and so as long as equation (5) is differentiable,standard techniques for finding the minimum of an unconstrained functionmay be applied. This holds if the loss function L is differentiable, inparticular for L(χ)=max(0, χ)^(p) for p>1. For non-differentiable cases,such as the linear loss function L(χ)=max(0, χ), a simple smoothingtechnique can be employed, e.g. a Huber loss function could be used, asdiscussed in Vapnik. The objection function is also referred to as aregularised risk. ${L(\xi)} = \left\{ \begin{matrix}{0} & {{{{{for}\quad \xi} \leq 0},}} \\{{{\xi^{2}/4}\delta}} & {{{{{for}\quad 0} < \xi \leq \delta},}} \\{{\xi - \delta}} & {{{otherwise}.}}\end{matrix} \right.$

[0041] Two methods for minimising equation (5) are given below. They arederived from the explicit expression for the gradient of the function:$\begin{matrix}\begin{matrix}{{{\nabla_{w}\Psi} = {w - {C{\sum\limits_{i = 1}^{n}{y_{i}x_{i}{L^{\prime}\left( {1 - {y_{i}\left( {{x_{i} \cdot w} + {\beta \quad b}} \right)}} \right)}}}}}},} \\{{\nabla_{b}\Psi} = {{- C}\quad \beta {\sum\limits_{i = 1}^{n}{y_{i}{{L^{\prime}\left( {1 - {y_{i}\left( {{x_{i} \cdot w} + {\beta \quad b}} \right)}} \right)}.}}}}}\end{matrix} & (6)\end{matrix}$

[0042] The first method executes a gradient descent technique to obtainthe vector w iteratively using the following: $\begin{matrix}\begin{matrix}{w^{t + 1} = {w^{t} - {\delta {\nabla_{w}\Psi}}}} \\{{= {w^{t} - {\delta \left( {w^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}x_{i}}}}} \right)}}},} \\{b^{t + 1} = {b^{t} - {\delta {\nabla_{b}\Psi}}}} \\{= {b^{t} + {{\delta\beta}\quad C{\sum\limits_{i = 1}^{n}{L^{\prime}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}}}}}\end{matrix} & (7)\end{matrix}$

[0043] where δ controls the steps size and t represents the “time” oriteration step. The value of the parameter δ can be either fixed or canbe made to decrease gradually. One robust solution for p=2 is to use δcalculated by the formula:$\delta = \frac{{{\nabla_{w}{\Psi \left( {w^{t},b^{t}} \right)}}}^{2} + {\nabla_{b}{\Psi \left( {w^{t},b^{t}} \right)}^{2}}}{{{\nabla_{w}{\Psi \left( {w^{t},b^{t}} \right)}}}^{2} + {2C{\sum\limits_{{i = 1};{{y_{i}{({{w \cdot x_{i}} + {\beta \quad b^{t}}})}} < 1}}\left( {{x_{i} \cdot {\nabla_{w}{\Psi \left( {w^{t},b^{t}} \right)}}} + {\nabla_{b}{\Psi \left( {w^{t},b^{t}} \right)}}} \right)^{2}}}}$

[0044] where ∇_(w)Ψ and □_(b)Ψ are calculated from (6) simplify$\begin{matrix}{{{\nabla_{w}\Psi} = {w - {2C{\sum\limits_{i}{{y_{i}\left( {1 - {y_{i}{w \cdot x_{i}}} - {y_{i}\beta \quad b}} \right)}x_{i}}}}}},} \\{{\nabla_{b}\Psi} = {w - {2C\quad \beta {\sum\limits_{i}{y_{i}\left( {1 - {y_{i}{w \cdot x_{i}}} - {y_{i}\beta \quad b}} \right)}}}}}\end{matrix}$

[0045] with summation taken over all indices i such thaty_(i)(1−y_(i)w·x_(i)−y_(i)βb)>0.

[0046] The second method, valid in the homogeneous case of β=0, is afixed point technique which involves simply setting the gradient ofequation (6) to zero, and again solving for the vectors w iteratively.Accordingly, with ∇_(w)Ψ=0 this allows the minimum of equation (5) to befound using: $\begin{matrix}{w^{t + 1} = {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {1 - {y_{i}{w^{t} \cdot x_{i}}}} \right)}y_{i}{x_{i}.}}}}} & (8)\end{matrix}$

[0047] The iterative training process of equation (8) can, in someinstances, fail to converge to a set of vectors, but when it doesconverge it does very rapidly. The training process of equation (7) isnot as rapid as that of equation (8), but it will always convergeprovided δ is sufficiently small. The two processes can be executed inparallel to ensure convergence to a set of vectors for an SVM.

[0048] The training processes of equations (7) and (8) involve searchingfor “separating” hyperplanes in the original input space of actualm-dimensional observations x_(i), such as the optimal hyperplane 26shown in FIG. 2 where ξ=1−y_(i)(w·x_(i)+βb). This approach can beextended to search for a hyperplane in a high dimensional or eveninfinite dimensional space of feature vectors. This hyperplanecorresponds to a non-linear surface in the original space, such as theoptimal hypersurface 30 shown in FIG. 3.

[0049] In many situations of practical interest the data vectorsx_(i)εR^(m) live in a very high dimensional space, m>>1, or possiblym=∞. However, often they can be parameterised by lower dimensionalobservation vectors {tilde over (x)}_(i)εR^({tilde over (m)}),x_(i)=Φ({tilde over (x)}_(i)), with the property that the dot productscan be calculated by an evaluation of a Mercer kernel function k, i.e.:

x _(i) *x _(j)=Φ({tilde over (x)} _(i))·Φ({tilde over (x)}_(j))=k({tilde over (x)} _(i) , {tilde over (x)} _(j)).  (9)

[0050] The Mercer kernel function is discussed in Vapnik and Burges.Vectors {tilde over (x)}_(i) are actual observations, while ‘feature’vectors x_(i) are conceptual, but not directly observable, in thiscontext. In such a case, the vector w determining the optimal hyperplanein the features space cannot be practically represented explicitly by acomputer system. The way around this obstacle is to use the “dataexpansion” of the optimal solution $\begin{matrix}{w = {{\sum\limits_{i = 1}^{n}\quad {y_{i}\alpha_{i}x_{i}}} = {\sum\limits_{i = 1}^{n}\quad {y_{i}\alpha_{i}{{\Phi \left( {\overset{\sim}{x}}_{i} \right)}.}}}}} & (10)\end{matrix}$

[0051] where α_(i)≧0 (referred to as Lagrange multipliers). The optimalSVM is uniquely determined by those coefficients, because for any vector{tilde over (x)}_(i)εR^({tilde over (m)}),${y\left( \overset{\sim}{x} \right)} = {{w \cdot {\Phi \left( \overset{\sim}{x} \right)}} = {\sum\limits_{i = 1}^{n}\quad {y_{i}\alpha_{i}{k\left( {{\overset{\sim}{x}}_{i},\quad \overset{\sim}{x}} \right)}}}}$

[0052] Taking advantage of this property, the above training processesare reformulated as follows. Instead of searching for w directly, thedot products w·x_(i)=w·Φ({tilde over (x)}_(i)) for i=1, 2, . . . , n aresearched for and are found by taking the dot product on both sides ofequations (7) and (8), respectively. In the case of gradient descentmethod, this gives rise to: $\begin{matrix}{{w^{t + 1} \cdot x_{j}} = {{w^{t} \cdot x_{j}} - {\delta\left( {{w^{t} \cdot x_{j}} - {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {1 - {y_{i}{w^{t} \cdot x_{i}}} - {y_{i}\beta \quad b^{t}}} \right)}y_{i}{x_{i} \cdot x_{j}}}}}} \right)}}} & (11)\end{matrix}$

[0053] leading to the “non-linear” version of gradient descent processbeing $\begin{matrix}{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta\left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}}} \right)}y_{i}{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}}}}} \right)}}} & (12)\end{matrix}$

[0054] where n_(j) ^(t)=w^(t)αx_(j) and η_(j) ^(t+1)=w^(t+1)·x_(j) andδ>0 is a free parameter.

[0055] Similarly, the non-linear version of the fixed-point process (forβ=0) is given by: $\begin{matrix}{\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {1 - {y_{i}\quad \quad \eta_{j}^{t}}} \right)}y_{i}{{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}\quad.}}}}} & (13)\end{matrix}$

[0056] Having used the iterative process defined by equations (12) and(13) to find the optimal values, η_(j)(j=1, . . . , n), and bias b, thecoefficients α_(i) defined in equation (10) need to be determined. Oneapproach is to solve the system of equations $\begin{matrix}{{\eta_{j} = {\sum\limits_{i = 1}^{n}\quad {y_{i}\alpha_{i}{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}\left( {j = {1,\quad \ldots \quad,\quad n}}\quad \right)}}}\quad} & (14)\end{matrix}$

[0057] but this is computationally difficult, as the problem isinvariably singular. A better approach is to note from equation (7) thatthe coefficients are given by

α_(i) =CL′(1−y _(i)η_(i) −βb)  (15)

[0058] The training processes described above can also be extended foruse in establishing an SVM for data regression, more precisely,ε-insensitive regression as discussed in Vapnik. Given labelled trainingdata (x₁,y₁), . . . , (x_(n),y_(n))εR^(m)×R, analogous to equations (3)and (4) the primal problem for regression is to minimise $\begin{matrix}{{\Psi (w)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}\quad {\overset{\sim}{L}\left( \xi_{i} \right)}}}}} & (16)\end{matrix}$

[0059] subject to

|y _(i) −wβx _(i) −βb|≦ε+ξ _(i) and ξ_(i)≧0 for i=1, . . . ,n,  (17)

[0060] where C, ε>0 are free parameters and L is the loss function asbefore. This problem is equivalent to minimisation of the followingfunction $\begin{matrix}{{\Psi (w)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}\quad {L\left( {{{y_{i} - {w \cdot x_{i}} - {\beta \quad b}}} - ɛ} \right)}}}}} & (18)\end{matrix}$

[0061] analogous to equation (5) for classification, where as before wedefine the loss L(χ)={tilde over (L)}(max(0, χ)). Further in a similarmanner to equations (7) and (8), for the linear case the gradientdescent process for regression takes the form $\begin{matrix}\begin{matrix}{w^{t + 1} = {w^{t} - {{\delta\left( {w^{t} - {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}x_{i}}}}} \right)},}}} \\{b^{t + 1} = {b^{t} + {{\delta\beta}\quad C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}}}}}}\end{matrix} & (19)\end{matrix}$

[0062] and the fixed point algorithms for regression becomes:$\begin{matrix}{w^{t + 1} = {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - {w^{t} \cdot x_{i}}} \right)}y_{i}x_{i}}}}} & (20)\end{matrix}$

[0063] The above training process can therefore be used to determine aregression function 40, as shown in FIG. 4, for the linear case wherethe deviation is defined as ξ_(i)=|y_(i)−(w·x_(i)+βb)|−ε. This is for(ε-insensitive) regression.

[0064] The iterative processes (19) and (20) can also be extended to thenon-linear (kernel) case to provide a regression function 50, as shownin FIG. 5, defining the optimal hypersurface to give the kernel versionof the gradient descent process for regression:$\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta\left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}}}}} \right)}}$$b^{t + 1} = {b^{t} + {{\delta\beta}\quad C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}}}}}$

[0065] and the kernel version of the fixed point algorithm forregression (β=0):$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t}} \right)}{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}}}}$

[0066] Having derived the optimal values η_(j) (j=1, . . . , n) and b,i.e. the fixed point (η₁, . . . ,η_(n)), from one of the above iterativeprocesses, the optimal SVM regressor function 50 is defined by${y\left( \overset{\sim}{x} \right)} = {{\sum\limits_{i = 1}^{n}\quad {\beta_{i}{k\left( {\overset{\sim}{x},\quad {\overset{\sim}{x}}_{i}} \right)}}} + {\beta \quad b}}$

[0067] where the coefficients β_(i) (Lagrange multipliers) are derivedfrom the following equation which is analogous to equation (15)

β_(i) =CL′(|y _(i)−η_(i) −βb|−ε)sgn(y _(i)−η_(i) −βb)

[0068] The above techniques are also applicable to another class oflearning machines algorithms, referred to as regularisation networks(RNs), as discussed in G. Kimeldorf and G. Wahba, A correspondencebetween Bayesian estimation of stochastic processes and smoothing byspines, Anm. Math. Statist, 1970, 495-502; F. Girosi, M. Jones and T.Poggio, Regularization Theory and Neural Networks Architectures, NeuralComputation. 1995, 219-269; and G. Wahba, Support Vector Machines,Reproducing Kernel Hilbert Spaces and the Randomized GACV2000, in B.Scholkopf, C. J. Burges and A. Smola, eds., Advances in KernelAethods−Support Vector Learning, MIT Press, Cambridge, USA, 1998, pp69-88. The following extends the previous processes to this class oflearning machines, given labelled training data (x₁,y₁), . . .,(x_(n),y_(n))εR^(m)×R. Analogous to equations (3) and (4) RN is definedas the minimiser to an (unconstrained) regularised risk $\begin{matrix}{{\overset{\sim}{\Psi}(w)} = {{\frac{\lambda}{2}{w \cdot w}} + {\sum\limits_{i = 1}^{n}\quad {L\left( \xi_{i} \right)}}}} \\{= {{\frac{\lambda}{2}{w \cdot w}} + {\sum\limits_{i = 1}^{n}\quad {L\left( {y_{i} - {w \cdot x_{i}} - {\beta \quad b}} \right)}}}}\end{matrix}$

[0069] where λ>0 is a free parameter (regularisation constant) and L isthe convex loss function, e.g. L(ξ)=ξ^(p) for p≧1. This problem isequivalent to minimisation of the following functional${\Psi \left( {w,b} \right)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {y_{i} - {w \cdot x_{i}} - {\beta \quad b}} \right)}}}}$

[0070] under assumption λ=C⁻¹. The latest functional has the form ofequation (16), and the techniques analogous to those described above canbe employed to find its minimum. Analogous to equation (19), in thelinear case, the gradient descent algorithm for RN takes the form$\begin{matrix}{w^{t + 1} = {w^{t} - {\delta \quad {\nabla_{w}\Psi}}}} \\{{= {w^{t} - {\delta \left( {w^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}x_{i}}}}} \right)}}},} \\{b^{t + 1} = {b^{t} - {\delta \quad {\nabla_{b}\Psi}}}} \\{= {b^{t} + {\delta \quad \beta \quad C{\sum\limits_{i = 1}^{n}{L^{\prime}\left( {y_{i} - {w^{t} \cdot x_{i}} - {\beta \quad b}} \right)}}}}}\end{matrix}$

[0071] and the fixed point algorithms for RN becomes:

w ^(t+1) =CΣ _(i=1) ^(n) L′(y _(i)=w^(t) ·x _(i) −βb)x_(i)

[0072] Those two algorithms extended to the non-linear (kernel) caseyield the kernel version of gradient descent algorithm for RN:$\begin{matrix}{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {{\overset{\sim}{x}}_{i},{\overset{\sim}{x}}_{j}} \right)}}}}} \right)}}} \\{b^{t + 1} = {b^{t} + {\delta \quad \beta \quad C{\sum\limits_{i = 1}^{n}{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}}}}}\end{matrix}$

[0073] and the kernel version of the fixed point algorithm for RN (β=0):$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {{\overset{\sim}{x}}_{i},{\overset{\sim}{x}}_{j}} \right)}}}}$

[0074] Having found the optimal values η_(j) (j=1, . . . ,n), from theabove algorithms, the optimal regressor is defined as${y\left( \overset{\sim}{x} \right)} = {{\sum\limits_{i = 1}^{n}{\beta_{i}\quad {k\left( {\overset{\sim}{x},{\overset{\sim}{x}}_{i}} \right)}}} + {\beta \quad b}}$

[0075] where the coefficients (Lagrange multipliers) β_(i) are derivedfrom the following equation analogous to equation (15)

β_(i) =CL′(y _(i)−η_(i) −βb)

[0076] One example of the many possible applications for the SVM, is touse the SVM to effectively filter unwanted email messages or “Spam”. Inany given organisation, a large amount of email messages are receivedand it is particularly advantageous to be able to remove those messageswhich are unsolicited or the organisation clearly does not Want itspersonnel to receive. Using the fast training processes described above,which are able to operate on large data sets of multiple dimensions,several to several hundred emails can be processed to establish an SVMwhich is able to classify emails as either being bad or good.

[0077] The training data set includes all of the text of the emailmessages and each word or phrase in a preselected dictionary can beconsidered to constitute a dimension of the vectors.

[0078] For instance if D={phrase₁, . . . , phrase_(m)} is a preselecteddictionary of words and phrases to be looked for, with each email E anmn-dimensional vector of frequencies can be associated

x=x(E)=(freq ₁(E), . . . ,freq _(m)(E))

[0079] where freq_(i) (E) gives the number (frequency) of the phrasephrase_(i) appeared in the email E. In the classification phase thelikelihood of email E being Spain is estimated as${y(E)} = {{w \cdot x} = {\sum\limits_{j = 1}^{m}\quad {w_{j}\quad {{freq}_{j}(E)}}}}$

[0080] where the vector w=(w₁, . . . , w_(m)) defining the decisionsurface is obtained using the training process of equation (7) or (8)for the sequence of training email vectors x_(i)=(freq₁(E), . . . ,freq_(m)(E_(i))), each associated with the training label y_(i)=1 for anexample of a Spam email and y_(i)=−1 for each allowed email, i=1, . . ., n.

[0081] Other applications for the SVM include continuous speechrecognition, image classification, particle identification for highenergy physics, object detection, combustion engine knock detection,detection of remote protein homologies, 3D object recognition, textcategorisation (as discussed above), time series prediction andreconstruction for chaotic systems, hand written digit recognition,breast cancer diagnosis and prognosis based on breast cancer data sets,and decision tree methods for database marketing.

[0082] Many modifications will be apparent to those skilled in the artwithout departing from the scope of the present invention as hereindescribed with reference to the accompanying drawings.

1. A training method for a support vector machine, including executingan iterative process on a training set of data to determine parametersdefining said machine, said iterative process being executed on thebasis of a differentiable form of a primal optimisation problem for saidparameters, said problem being defined on the basis of said parametersand said data set.
 2. A method as claimed in claim 1, wherein saidmethod is adapted for generation of a kernel learning machine.
 3. Amethod as claimed in claim 1, wherein said method is adapted to generatea regularisation network.
 4. A method as claimed in claim 1, wherein forclassification, and for the SVM, y=sgn(w·x+βb), where y is the output, xis the input data, β is 0 or 1, the vector w and bias b, beingparameters defining a decision surface are obtained by minimising thedifferentiable objective function:${\Psi \left( {w,b} \right)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {1 - {y_{i}\left( {{w \cdot x_{i}} + {\beta \quad b}} \right)}} \right)}}}}$

where C>0 is a free parameter, x_(i), i=1, . . . , n, are data points ofthe training set, y_(i)=±1, . . . , n, are known labels, n is the numberof data points and L is a differentiable loss function such that L(ξ)=0for ξ≦0.
 5. A method as claimed in claim 4, wherein said iterativeprocess operates on a derivative of the objective function Ψ until thevectors converge to a vector w for the machine.
 6. A method as claimedin claim 1, wherein for ε-insensitive regression, the vector w and biasb, being parameters defining a decision surface, are obtained byminimising the differentiable objective function${\Psi \left( {w,b} \right)} = {{\frac{1}{2}{w \cdot w}} + {C{\sum\limits_{i = 1}^{n}{L\left( {{{y_{i} - {w \cdot x_{i}} + {\beta \quad b}}} - ɛ} \right)}}}}$

where the ε>0 is a free parameter, C>0 is a free parameter, β is 0 or 1,x_(i), i=1, . . . , n, are the training data points of the data set,y₁=±1, i=1, . . . ,n, are known labels, n is the number of data pointsand L is a differentiable loss function such that L(ξ)=0 for ξ<0.
 7. Asupport vector machine for a classification task having an output ygiven by$y = {{y(x)} = {{\sum\limits_{i = 1}^{n}{y_{i}\alpha_{i}{k\left( {x_{i},x_{j}} \right)}}} + {\beta \quad b}}}$

where xεR^(m) is a data point to be classified and x_(i) are trainingdata points, k is a kernel function, and α_(i) are coefficientsdetermined by α_(i) =CL′(1−y _(i)η_(i) −βb) where L′(ξ) is thederivative of the loss and the values η_(i) are determined byiteratively executing $\begin{matrix}{{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{i}^{t}} - {y_{i}\beta \quad b^{t}}} \right)}y_{i}{k\left( {x_{i},x_{j}} \right)}}}}} \right)}}},} \\{b^{t + 1} = {{\beta \quad b^{t}} + {\delta \quad \beta \quad C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}} - {y_{i}\beta \quad b^{t}}} \right)}{y_{i}.}}}}}}\end{matrix}$

where ε>0 is a free parameter representing a learning rate and/or, byiteratively executing in the homogeneous case (β=0):$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {1 - {y_{i}\eta_{j}^{t}}} \right)}y_{i}{{k\left( {x_{i},x_{j}} \right)}.}}}}$

where i, j=1, . . . , n, n are the number of data points, t representsan iteration and L′ is the derivative of a loss function L.
 8. A supportvector machine for F-regression having output y given by${y(x)} = {{\sum\limits_{i = I}^{n}\quad {\beta_{i}{k\left( {x,x_{i}} \right)}}} + {\beta \quad b}}$

where xεR^(m) is a data point to be evaluated and x_(i) are trainingdata points, k is a kernel function, β=0 or 1, and β_(i) and bias b arecoefficients determined by β_(i) =CL′(|y _(i)−η_(i) −βb|−ε)sgn(y_(i)−η_(i) −βb) where ε is a free parameter and the values η_(i) and bare determined by iteratively executing $\begin{matrix}{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {x_{i},x_{j}} \right)}}}}} \right)}}} \\{b^{t + 1} = {b^{t} + {\delta \quad \beta \quad C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{i}^{t} - {\beta \quad b}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}}}}}}\end{matrix}$

where δ>0 is a free parameter representing a learning rate and/or, byiteratively executing in the homogeneous case (β=0):$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {{{y_{i} - \eta_{j}^{t}}} - ɛ} \right)}{{sgn}\left( {y_{i} - \eta_{i}^{t}} \right)}{{k\left( {x_{i},x_{j}} \right)}.}}}}$

where i, j=1, . . . , n, n being the number of data points and trepresents an iteration and L′ is the derivative of a loss function L.9. A regularisation network having output y given by${y(x)} = {{\sum\limits_{i = 1}^{n}{\beta_{i}{k\left( {x,x_{i}} \right)}}} + {\beta \quad b}}$

where xεR^(m) is a data point to be evaluated and x_(i) are trainingdata points, k is a kernel function, β=0 or 1, and β_(i) and bias b arecoefficients determined by β_(i) =CL′(|y _(i)−η_(i) −βb|−ε) where ε is afree parameter and the values η_(i) and b are determined by iterativelyexecuting $\begin{matrix}{\eta_{j}^{t + 1} = {\eta_{j}^{t} - {\delta \left( {\eta_{j}^{t} - {C{\sum\limits_{i = 1}^{n}{{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {{\overset{\sim}{x}}_{i},{\overset{\sim}{x}}_{j}} \right)}}}}} \right)}}} \\{b^{t + 1} = {b^{t} + {\delta \quad \beta \quad C{\sum\limits_{i = 1}^{n}{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}}}}}\end{matrix}$

where δ>0 is a free parameter representing a learning rate and/or, byiteratively executing in the homogeneous case (δ=0):$\eta_{j}^{t + 1} = {C{\sum\limits_{i = 1}^{n}\quad {{L^{\prime}\left( {y_{i} - \eta_{i}^{t} - {\beta \quad b}} \right)}{k\left( {{\overset{\sim}{x}}_{i},\quad {\overset{\sim}{x}}_{j}} \right)}}}}$

where i j 1, . . . , n, n being the number of data points and trepresents an iteration and L′ is the derivative of a loss function L.10. A method of generating a support vector machine with adifferentiable penalty by direct minimisation of a primal problem.
 11. Amethod as claimed in claim 10, wherein said minimisation is executedusing unconstrained optimisation.
 12. A method as claimed in claim 10,including executing said minimisation using a gradient descent method,with a step size 3 determined by minimisation of a 1-dimensionalrestriction of an objective function in relation to the direction of thegradient of the function.
 13. A method as claimed in claim 10, includingexecuting said minimisation using a gradient descent method, with a stepsize 3 determined by minimisation of a 1-dimensional restriction of anapproximation of an objective function in relation to the direction ofthe gradient of the function.
 14. A method of generating support vectormachine with a linear penalty by executing a process to solvedifferentiable loss as a smoothed approximation of a linear loss,L(ξ)−max(0,ξ).
 15. A method of generating a decision surface for alearning machine by iteratively executing a solution for a vectorrepresenting said decision surface using a training data set, saidsolution being based on a gradient of a differentiable form of anoptimisation function for said surface.
 16. Computer software havingcode for executing the steps of a method as claimed in any one of claims1 to 6 and 10 to 15.